Skip to content

HYPERFLEET-854 - feat: implement hard deletion for clusters and nodepools#119

Open
mliptak0 wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
mliptak0:hyperfleet-api-854
Open

HYPERFLEET-854 - feat: implement hard deletion for clusters and nodepools#119
mliptak0 wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
mliptak0:hyperfleet-api-854

Conversation

@mliptak0
Copy link
Copy Markdown
Contributor

@mliptak0 mliptak0 commented Apr 30, 2026

implement hard deletion for clusters and nodepools

Summary

Test Plan

  • Unit tests added/updated
  • make test-all passes
  • make lint passes
  • Helm chart changes validated with make test-helm (if applicable)
  • Deployed to a development cluster and verified (if Helm/config changes)
  • E2E tests passed (if cross-component or major changes)

Summary by CodeRabbit

  • New Features

    • Hard deletion of cluster and node pool resources after soft deletion when all required adapters are finalized and no child resources exist
    • New intermediate status condition that manages resources waiting for child resources to be removed during the deletion process
  • Changed

    • Improved condition timestamp handling to properly preserve state transitions across status updates

@openshift-ci openshift-ci Bot requested review from aredenba-rh and vkareh April 30, 2026 13:45
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vkareh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Walkthrough

This pull request introduces hard-deletion semantics for Cluster and NodePool resources, enabling permanent removal after soft deletion when all required adapter statuses report Finalized=True and no child resources remain. Key changes include: adding an IsFinalized() method to AdapterStatus for condition checking; refactoring AdapterStatusDao.Upsert() to accept a pre-fetched existing record and adding DeleteByResource capability; extending Cluster and NodePool DAOs with locked reads (GetForUpdate) and targeted status updates (SaveStatusConditions); refactoring service-layer logic in cluster and node pool processing to lock rows, validate finalization state, and conditionally hard-delete via cascading deletions; enhancing aggregation with a new WaitingForChildResources intermediate condition state during deletion; and adding integration tests validating hard-deletion behavior across cluster and node pool resources.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ClusterService
    participant ClusterDAO
    participant AdapterStatusDAO
    participant DB

    rect rgba(100, 150, 200, 0.5)
        Note over Client,DB: Hard-Delete Flow: Cluster Soft-Deletion
    end

    Client->>ClusterService: ProcessAdapterStatus(clusterID, adapterStatus)
    activate ClusterService
    
    ClusterService->>ClusterDAO: GetForUpdate(clusterID)
    activate ClusterDAO
    ClusterDAO->>DB: SELECT * FROM clusters WHERE id=? FOR UPDATE
    DB-->>ClusterDAO: cluster row (soft-deleted)
    ClusterDAO-->>ClusterService: *Cluster
    deactivate ClusterDAO

    ClusterService->>AdapterStatusDAO: FindByResource(clusterID)
    activate AdapterStatusDAO
    AdapterStatusDAO->>DB: SELECT * FROM adapter_statuses WHERE resourceID=?
    DB-->>AdapterStatusDAO: all adapter statuses
    AdapterStatusDAO-->>ClusterService: []AdapterStatus
    deactivate AdapterStatusDAO

    ClusterService->>ClusterService: validateAndClassify(adapterStatus)
    ClusterService->>AdapterStatusDAO: Upsert(ctx, adapterStatus, existing)
    activate AdapterStatusDAO
    AdapterStatusDAO->>DB: UPDATE/INSERT adapter_statuses
    DB-->>AdapterStatusDAO: updated AdapterStatus
    AdapterStatusDAO-->>ClusterService: *AdapterStatus
    deactivate AdapterStatusDAO

    alt All Required Adapters Finalized AND NO Child NodePools
        ClusterService->>AdapterStatusDAO: DeleteByResource(clusterID)
        activate AdapterStatusDAO
        AdapterStatusDAO->>DB: DELETE FROM adapter_statuses WHERE resourceID=?
        DB-->>AdapterStatusDAO: success
        deactivate AdapterStatusDAO

        ClusterService->>ClusterDAO: Delete(clusterID)
        activate ClusterDAO
        ClusterDAO->>DB: DELETE FROM clusters WHERE id=?
        DB-->>ClusterDAO: success
        deactivate ClusterDAO
        
        ClusterService-->>Client: cluster hard-deleted

    else Child NodePools Exist OR Adapters Not Finalized
        ClusterService->>ClusterService: recomputeAndSaveClusterStatus(cluster, adapterStatuses)
        ClusterService->>ClusterDAO: SaveStatusConditions(clusterID, conditions)
        activate ClusterDAO
        ClusterDAO->>DB: UPDATE clusters SET status_conditions=? WHERE id=?
        DB-->>ClusterDAO: success
        deactivate ClusterDAO
        
        ClusterService-->>Client: cluster soft-deleted, waiting for finalization
    end

    deactivate ClusterService
Loading
sequenceDiagram
    participant Client
    participant AggregationService
    participant ConditionAggregator

    rect rgba(150, 100, 150, 0.5)
        Note over Client,ConditionAggregator: Condition Aggregation: Deletion with Child Resources
    end

    Client->>AggregationService: ComputeReconciled(adapterStatuses, deletedTime, hasChildResources=true)
    activate AggregationService

    AggregationService->>AggregationService: Check: all adapters finalized?
    
    alt deletedTime != nil AND allFinalized AND hasChildResources=true
        AggregationService->>ConditionAggregator: Create condition(Type=Reconciled, Status=False, Reason=WaitingForChildResources)
        ConditionAggregator-->>AggregationService: condition
        AggregationService-->>Client: Reconciled=False, Reason=WaitingForChildResources

    else deletedTime != nil AND allFinalized AND hasChildResources=false
        AggregationService->>ConditionAggregator: Create condition(Type=Reconciled, Status=True, Reason=AllAdaptersReconciled)
        ConditionAggregator-->>AggregationService: condition
        AggregationService-->>Client: Reconciled=True, Reason=AllAdaptersReconciled

    else All adapters not finalized
        AggregationService->>ConditionAggregator: Create condition(Type=Reconciled, Status=False, Reason=MissingRequiredAdapters)
        ConditionAggregator-->>AggregationService: condition
        AggregationService-->>Client: Reconciled=False, Reason=MissingRequiredAdapters
    end

    deactivate AggregationService
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature being implemented: hard deletion for clusters and nodepools, which aligns with the extensive changes across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 12-13: The changelog contains invalid PR links using
"/pull/HYPERFLEET-854" which must be replaced with the numeric PR path
"/pull/119"; update both occurrences in CHANGELOG.md (the two lines referencing
HYPERFLEET-854 and the additional occurrence noted at lines 31-31) so each
markdown link points to
https://github.com/openshift-hyperfleet/hyperfleet-api/pull/119 instead of the
non-numeric form.

In `@pkg/dao/adapter_status.go`:
- Around line 98-100: When updateResult.RowsAffected == 0 in the adapter status
update flow, explicitly distinguish a stale-report (concurrent version mismatch)
from a row that was deleted between read and update: after detecting
RowsAffected == 0, re-query the DB for the row using the same key/ID from
existing (e.g., existing.ID) — if the re-read returns no row, return a not-found
error (or trigger recreate logic) instead of returning existing; if the re-read
returns a row whose version/fields differ from existing, treat that as a stale
update and return existing (or a conflict error) so callers know the update
didn’t persist. Ensure this logic sits where updateResult.RowsAffected is
checked (the function handling the update in adapter_status.go).

In `@pkg/dao/cluster.go`:
- Around line 97-105: In SaveStatusConditions
(sqlClusterDao.SaveStatusConditions) add a check after the Update to treat zero
affected rows as not-found: if result.RowsAffected == 0, call
db.MarkForRollback(ctx, gorm.ErrRecordNotFound) (or sql.ErrNoRows) and return an
appropriate not-found error instead of nil; keep the existing error handling for
result.Error unchanged so true DB errors still roll back.

In `@pkg/dao/node_pool.go`:
- Around line 60-68: In SaveStatusConditions, don't treat a zero-row update as
success: after calling g2.Model(&api.NodePool{}).Where(...).Update(...) inspect
result.RowsAffected and if it is 0 return a not-found error (and call
db.MarkForRollback(ctx, err) as appropriate) instead of returning nil; keep
existing behavior for result.Error non-nil. Use the SaveStatusConditions
function and result.RowsAffected to implement this check so concurrent
hard-delete/update races surface as not-found.

In `@pkg/services/adapter_status.go`:
- Around line 125-130: The lookup code around
s.adapterStatusDao.FindByResourceAndAdapter currently treats any error as a "not
found" by setting existing = nil, which can hide real DAO failures before
calling Upsert; update the logic in the function to only normalize to nil when
the DAO returns the explicit not-found sentinel (e.g., ErrNotFound or the dao's
not-found behavior) and to return the lookup error immediately for any other
error; locate the call to s.adapterStatusDao.FindByResourceAndAdapter(ctx,
adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter) and
replace the unconditional existing = nil on error with a conditional that checks
the error value/type and either sets existing = nil for not-found or returns the
error so Upsert sees only valid existing state.

In `@pkg/services/aggregation.go`:
- Around line 740-754: The helper allAdaptersFinalized currently only checks
adapterStatus.IsFinalized() and thus considers finalization from any generation;
change it to require Finalized==true at the current resource generation by
adding a generation parameter (e.g., currentGeneration int64) and only count
adapterStatus entries whose Generation equals that parameter and whose
IsFinalized() is true; update callers (cluster/nodepool hard-delete call sites)
to pass the current resource generation, mirroring the generation filtering
approach used by computeReconciled, so older finalization rows are ignored.

In `@pkg/services/node_pool.go`:
- Around line 272-279: After a successful hard delete in the node-pool deletion
path (when nodePool.DeletedTime != nil and tryHardDeleteNodePool returns true),
ensure we re-check/requeue the parent cluster using nodePool.OwnerID so the
cluster's hard-delete/status is recomputed now that the final child is gone;
update the same logic in the other occurrence (around lines 381-403) to also
trigger the parent cluster reconciliation instead of returning immediately (use
the existing cluster-reconciliation/enqueue helper on the service, e.g., the
method responsible for cluster status recomputation or enqueuing by ID).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: cdeffff4-37b5-409f-b5f2-8265a9d7ff7a

📥 Commits

Reviewing files that changed from the base of the PR and between 3622815 and a7969d3.

📒 Files selected for processing (17)
  • CHANGELOG.md
  • pkg/api/adapter_status_types.go
  • pkg/dao/adapter_status.go
  • pkg/dao/cluster.go
  • pkg/dao/mocks/cluster.go
  • pkg/dao/mocks/node_pool.go
  • pkg/dao/node_pool.go
  • pkg/services/adapter_status.go
  • pkg/services/aggregation.go
  • pkg/services/aggregation_test.go
  • pkg/services/cluster.go
  • pkg/services/cluster_test.go
  • pkg/services/node_pool.go
  • pkg/services/node_pool_test.go
  • pkg/services/status_helpers.go
  • test/integration/clusters_test.go
  • test/integration/node_pools_test.go

Comment thread CHANGELOG.md Outdated
Comment thread pkg/dao/adapter_status.go
Comment on lines 98 to 100
if updateResult.RowsAffected == 0 {
return d.FindByResourceAndAdapter(ctx, adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter)
return existing, nil
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Differentiate stale report from concurrently deleted row.

At Line 98, RowsAffected == 0 unconditionally returns existing. That is correct for stale reports, but incorrect when the record was deleted between read and update. In that case this method reports success with a non-persisted object. Please branch explicitly: stale → return existing; missing row → re-read/recreate or return not-found.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/dao/adapter_status.go` around lines 98 - 100, When
updateResult.RowsAffected == 0 in the adapter status update flow, explicitly
distinguish a stale-report (concurrent version mismatch) from a row that was
deleted between read and update: after detecting RowsAffected == 0, re-query the
DB for the row using the same key/ID from existing (e.g., existing.ID) — if the
re-read returns no row, return a not-found error (or trigger recreate logic)
instead of returning existing; if the re-read returns a row whose version/fields
differ from existing, treat that as a stale update and return existing (or a
conflict error) so callers know the update didn’t persist. Ensure this logic
sits where updateResult.RowsAffected is checked (the function handling the
update in adapter_status.go).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario CodeRabbit describes (record deleted between read and update) cannot happen in this code. The callers of Upsert in ProcessAdapterStatus hold a SELECT ... FOR UPDATE lock on the parent resource (cluster/nodepool) for the duration of the transaction. No concurrent transaction can delete the adapter status while this one is in progress.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliptak0, that's a valid point — the SELECT ... FOR UPDATE lock on the parent resource serializes all concurrent writes for the duration of the transaction, which makes the delete-between-read-and-update race impossible in this context. I've noted the locking strategy for future reviews.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment thread pkg/dao/cluster.go
Comment on lines +97 to +105
func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error {
g2 := (*d.sessionFactory).New(ctx)
result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions)
if result.Error != nil {
db.MarkForRollback(ctx, result.Error)
return result.Error
}
return nil
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle zero-row status updates as not-found.

At Line 99, this returns success even when no row matches id (RowsAffected == 0). In hard-delete races, that silently drops a status write and hides state loss. Treat zero affected rows as not-found and roll back.

Suggested patch
 func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error {
 	g2 := (*d.sessionFactory).New(ctx)
 	result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions)
 	if result.Error != nil {
 		db.MarkForRollback(ctx, result.Error)
 		return result.Error
 	}
+	if result.RowsAffected == 0 {
+		err := gorm.ErrRecordNotFound
+		db.MarkForRollback(ctx, err)
+		return err
+	}
 	return nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error {
g2 := (*d.sessionFactory).New(ctx)
result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions)
if result.Error != nil {
db.MarkForRollback(ctx, result.Error)
return result.Error
}
return nil
}
func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error {
g2 := (*d.sessionFactory).New(ctx)
result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions)
if result.Error != nil {
db.MarkForRollback(ctx, result.Error)
return result.Error
}
if result.RowsAffected == 0 {
err := gorm.ErrRecordNotFound
db.MarkForRollback(ctx, err)
return err
}
return nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/dao/cluster.go` around lines 97 - 105, In SaveStatusConditions
(sqlClusterDao.SaveStatusConditions) add a check after the Update to treat zero
affected rows as not-found: if result.RowsAffected == 0, call
db.MarkForRollback(ctx, gorm.ErrRecordNotFound) (or sql.ErrNoRows) and return an
appropriate not-found error instead of nil; keep the existing error handling for
result.Error unchanged so true DB errors still roll back.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive for the same reason as the Upsert comment.

In the ProcessAdapterStatus flow (the code path added by this PR), the cluster is locked with SELECT ... FOR UPDATE at the start of the transaction, preventing concurrent hard-delete. For the existing UpdateClusterStatusFromAdapters path (which doesn'thold a lock), silently dropping a status condition write for a concurrently hard-deleted cluster is harmless — the cluster no longer exists, so persisting its conditions serves no purpose.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliptak0, understood — that reasoning is sound. The SELECT ... FOR UPDATE lock in ProcessAdapterStatus eliminates the race in the new code path, and in UpdateClusterStatusFromAdapters a silent no-op on an already-hard-deleted cluster is semantically correct. I'll retract the comment.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment thread pkg/dao/node_pool.go
Comment on lines +125 to +130
existing, findErr := s.adapterStatusDao.FindByResourceAndAdapter(
ctx, adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter,
)
if findErr != nil {
existing = nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate unexpected lookup failures before upserting.

This treats every FindByResourceAndAdapter error as “not found”. If that read fails for any other reason, the code still proceeds into Upsert, which hides the DAO failure and can take the wrong branch for freshness handling. Only normalize an actual miss to nil; return other errors immediately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/adapter_status.go` around lines 125 - 130, The lookup code
around s.adapterStatusDao.FindByResourceAndAdapter currently treats any error as
a "not found" by setting existing = nil, which can hide real DAO failures before
calling Upsert; update the logic in the function to only normalize to nil when
the DAO returns the explicit not-found sentinel (e.g., ErrNotFound or the dao's
not-found behavior) and to return the lookup error immediately for any other
error; locate the call to s.adapterStatusDao.FindByResourceAndAdapter(ctx,
adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter) and
replace the unconditional existing = nil on error with a conditional that checks
the error value/type and either sets existing = nil for not-found or returns the
error so Upsert sees only valid existing state.

Comment on lines +740 to +754
func allAdaptersFinalized(requiredAdapters []string, adapterStatuses api.AdapterStatusList) bool {
finalizedAdapters := make(map[string]struct{})

for _, adapterStatus := range adapterStatuses {
if adapterStatus.IsFinalized() {
finalizedAdapters[adapterStatus.Adapter] = struct{}{}
}
}

for _, requiredAdapter := range requiredAdapters {
if _, exists := finalizedAdapters[requiredAdapter]; !exists {
return false
}
}
return true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require Finalized=True at the current generation.

allAdaptersFinalized only checks IsFinalized(). After soft-delete bumps the resource generation, an older Finalized=True row still counts here, so cluster/nodepool hard-delete can run before every required adapter has finalized the delete generation. This helper should filter on the current generation, the same way computeReconciled does.

Possible fix
-func allAdaptersFinalized(requiredAdapters []string, adapterStatuses api.AdapterStatusList) bool {
+func allAdaptersFinalized(
+	requiredAdapters []string,
+	currentGeneration int32,
+	adapterStatuses api.AdapterStatusList,
+) bool {
 	finalizedAdapters := make(map[string]struct{})

 	for _, adapterStatus := range adapterStatuses {
-		if adapterStatus.IsFinalized() {
+		if adapterStatus.ObservedGeneration == currentGeneration && adapterStatus.IsFinalized() {
 			finalizedAdapters[adapterStatus.Adapter] = struct{}{}
 		}
 	}

You'll also need to pass the resource generation from the cluster/nodepool hard-delete call sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/aggregation.go` around lines 740 - 754, The helper
allAdaptersFinalized currently only checks adapterStatus.IsFinalized() and thus
considers finalization from any generation; change it to require Finalized==true
at the current resource generation by adding a generation parameter (e.g.,
currentGeneration int64) and only count adapterStatus entries whose Generation
equals that parameter and whose IsFinalized() is true; update callers
(cluster/nodepool hard-delete call sites) to pass the current resource
generation, mirroring the generation filtering approach used by
computeReconciled, so older finalization rows are ignored.

Comment thread pkg/services/node_pool.go
Comment on lines +272 to 279
if nodePool.DeletedTime != nil {
hardDeleted, hdErr := s.tryHardDeleteNodePool(ctx, nodePoolID, conditions, updatedStatuses)
if hdErr != nil {
return nil, hdErr
}
if hardDeleted {
return upsertedStatus, nil
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Re-check the owning cluster after hard-deleting the last node pool.

Once tryHardDeleteNodePool succeeds, this flow returns immediately. If the parent cluster was already soft-deleted and its adapters had finalized earlier, nothing reevaluates that cluster when the final child disappears, so it can stay stranded until another cluster status update happens. The new cluster integration test currently works around this by reposting a cluster adapter after nodepool deletion. Use nodePool.OwnerID to trigger the parent cluster’s hard-delete/status recomputation here.

Also applies to: 381-403

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/node_pool.go` around lines 272 - 279, After a successful hard
delete in the node-pool deletion path (when nodePool.DeletedTime != nil and
tryHardDeleteNodePool returns true), ensure we re-check/requeue the parent
cluster using nodePool.OwnerID so the cluster's hard-delete/status is recomputed
now that the final child is gone; update the same logic in the other occurrence
(around lines 381-403) to also trigger the parent cluster reconciliation instead
of returning immediately (use the existing cluster-reconciliation/enqueue helper
on the service, e.g., the method responsible for cluster status recomputation or
enqueuing by ID).

Copy link
Copy Markdown
Contributor Author

@mliptak0 mliptak0 Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gap is real: after the last nodepool is hard-deleted, nothing automatically triggers the cluster's hard-delete check. The integration test though covers this by manually re-posting a cluster adapter status.

However, fixing this in this PR is a heavy lift because NodePoolService doesn't have a reference to ClusterService — adding one creates a cross-service dependency and potential circular dependency. The current design is event-driven: Sentinel periodically triggers adapter status updates, so the cluster will eventually get re-evaluated. Probably good case to be covered by e2e tests...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliptak0, that makes sense — the circular-dependency concern is a real constraint, and the Sentinel-driven eventual consistency model is a reasonable trade-off here. The integration test's manual re-post approach is a fair workaround for now.

Would you like me to open a follow-up GitHub issue to track the eventual-consistency gap (cluster hard-delete not triggered automatically after last node pool is removed) so it's not lost? It could also serve as a reminder to add E2E coverage for that scenario.

Comment thread pkg/dao/node_pool.go
func (d *sqlNodePoolDao) ExistsByOwner(ctx context.Context, ownerID string) (bool, error) {
g2 := (*d.sessionFactory).New(ctx)
var count int64
if err := g2.Model(&api.NodePool{}).Where("owner_id = ?", ownerID).Limit(1).Count(&count).Error; err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be simplified using raw SQL: SELECT EXISTS(SELECT 1 FROM node_pools WHERE owner_id = ?)

Comment thread pkg/services/cluster.go
// If already marked, it returns the cluster unchanged. Cascades the deletion timestamp to all child nodepools.
// Actual removal is handled by adapters detecting the new generation and triggering hard deletion asynchronously.
func (s *sqlClusterService) SoftDelete(ctx context.Context, id string) (*api.Cluster, *errors.ServiceError) {
cluster, err := s.clusterDao.Get(ctx, id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't part of this PR, but since we have a Save() method which updates from the model, would it be better to use GetForUpdate here?

Comment thread pkg/services/node_pool.go
}

func (s *sqlNodePoolService) SoftDelete(ctx context.Context, id string) (*api.NodePool, *errors.ServiceError) {
nodePool, err := s.nodePoolDao.Get(ctx, id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, maybe put GetForUpdate is better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants